Exploring Prometheus Metrics and Writing Rollout Queries

Learn how to explore Prometheus metrics and write rollout queries.

Generate some traffic#

We’re about to deploy Prometheus, but before we do, let’s start generating some traffic so that there are metrics we can explore.

First session#

We’ll open a new terminal and create an infinite loop that will send requests to the devops-toolkit app. That way, we’ll have a constant stream of metrics related to requests and responses.

Let’s start by outputting the Istio Gateway host.

echo $ISTIO_HOST

Please copy the output. We’ll need it soon.

Second session#

Open a second terminal session.

Now we can redeclare the ISTIO_HOST variable in the new terminal session. We’ll use it for constructing the address to which we’ll be sending a stream of requests.

Note: Please replace [...] with the output of the ISTIO_HOST variable you copied from the first terminal session.

export ISTIO_HOST=[...]

Now we’re ready to execute the loop.

Commands for Minikube

We should see a steady stream of responses with 200 OK statuses, with one second pauses between each one.

Note: If you’re using WSL, you might see errors like sleep: cannot read realtime clock: Invalid argument. If that’s the case, you’ve probably encountered a bug in WSL. The solution is to upgrade Ubuntu to 20.04 or a later version. Stop the loop with “Ctrl+C,” execute the commands that follow, and repeat the while loop command.

Install the requirements

Deploy Prometheus#

Now that we’re sending requests and, through them, generating metrics, we can deploy Prometheus and see them in action.

Please go back to the first terminal session and execute the commands that follow.

Deploy Prometheus

Normally, we’d configure the Istio Gateway, NGINX Ingress, or something similar to forward requests to Prometheus based on a domain. However, since Prometheus isn’t the main subject of this chapter, we’ll take the easier route and port-forward to the prometheus-server Deployment. That will allow us to open it through localhost on a specific port.

Monitor the namespace

The output is as follows.

We might need to press the "Enter" key to be released back to the terminal prompt.

Now we can open Prometheus in the default browser.

Next, we’ll explore a few metrics and queries. Bear in mind that we aren't going to do a deep dive in Prometheus. Instead, we’ll focus on creating a query that we might want to use to instruct Argo Rollouts on whether to move forward or to roll back a release.

If we'd like to automate rollout decision-making, the first step is to define the criteria. One simple, yet effective, strategy could be to measure the error rate of requests. We can use the istio_requests_total metric for that. It provides the total number of requests. Given that Prometheus metrics have labels we can use to filter the results, and that one of those attached to istio_requests_total is response_code, we should be able to distinguish those that don't fall into the 2xx range.

Please type the query that follows in the "Expression" field.

Press the "Execute" button and select the "Graph" field. We should see a graph with requests processed by Istio. Since we only installed Prometheus a few minutes ago and, therefore, it just started collecting metrics, we might want to adjust the timeframe to 5m or another shorter duration.

Prometheus graph with a single metric query

Retrieve raw metrics#

Retrieving raw metrics isn’t very useful by itself, so we might want to make it a bit more complex.

Instead of querying a metric alone, we should calculate the sum of the rate of requests passing through a specific Service, and calculated over a specific interval. We can do that by replacing the existing query with the one that follows.

Calculate the sum of the rate

Remember to press the “Execute” button.

That’s not very useful given that our goal is to see the percentage of errors, or to calculate the percentage of successful requests. We’ll choose the latter approach, and for that we’ll need to have a bit more elaborate query.

We should retrieve the sum of the rate of all successful queries (those in the “2xx” range) and divide it with the sum of the rate of all queries. Both expressions should be limited to a specific Service.

We can accomplish that through the query that follows.

Calculate the sum of the rate again with more conditions

Type (or copy and paste) that query into the “Expression” field and press the “Execute” button.

Prometheus graph with the percentage of successful requests

This is, finally, a query that produces useful results. The graph is kind of boring, but that’s a good thing. It shows that the results are constantly 1. Since the percentage is expressed as fractions of 1, it means that 100% of the requests were successful during that whole period. That’s expected because we have a constant loop of requests that are returning response code 200.

That’s all we’ll show in Prometheus. Let’s get back to deploying releases using Argo Rollouts. But, before we do, we won’t need to access Prometheus UI any more, so let’s kill port forwarding.

pkill kubectl

Note: We'll get hands-on experience with the concepts and commands discussed in this lesson in the project “Hands-on: Applying Progressive Delivery” right after this chapter.

Rolling Back New Releases

Exploring Automated Analysis